DAOS-18976 rebuild: refine rebuild SCAN process#18289
Conversation
|
Ticket title is 'Aurora rebuild failing with DER_HG / DER_SHUTDOWN' |
|
Test stage Build RPM on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/1/execution/node/318/log |
|
Test stage Test RPMs on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/1/execution/node/1159/log |
|
Test stage Build RPM on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/2/execution/node/336/log |
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/2/execution/node/335/log |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/2/execution/node/392/log |
|
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/2/display/redirect |
|
Test stage Test RPMs on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/2/execution/node/1143/log |
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/2/display/redirect |
1 similar comment
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/2/display/redirect |
|
Test stage Build RPM on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/3/execution/node/318/log |
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/3/execution/node/309/log |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/3/execution/node/332/log |
2b2479c to
0e86c5d
Compare
1. in rebuild_obj_scan_cb() Check rt_finishing to allow rebuild_scan_leader to exit quickly when rebuild_tgt_fini() is waiting for the refcount to drop. Without this, a stale scan_leader continues scanning all VOS objects indefinitely, blocking TLS cleanup and causing retries to fail with -DER_BUSY. 2. in rebuild_tgt_scan_handler() fix a race window between rebuild_tgt_fini() -> rebuild_pool_tls_destroy() and rebuild_pool_tls_lookup(). Signed-off-by: Liang Zhen <[email protected]> Co-authored-by: Xuezhao Liu <[email protected]>
0e86c5d to
eae35f3
Compare
|
Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/4/display/redirect |
Co-authored-by: Xuezhao Liu <[email protected]>
Co-authored-by: Xuezhao Liu <[email protected]>
|
Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18289/8/execution/node/900/log |
Steps for the author:
After all prior steps are complete: